Big Data is when the data is too large. Big Data usually have the 3Vs
Extracted from: https://financesonline.com/what-is-big-data-analytics-and-how-it-helps-you-understand-your-customers/
Volume: Data is too large
Variety: Data has many types
Velocity: Data increases very fast.
When Data is too large we use Map Reduce. I said while teaching in Kaplan and Uiversity of Portsmouth, it is common sense that when a computer cannot process a data, we use many computers to process the data.
Extracted from: https://www.edureka.co/blog/mapreduce-tutorial/
When data is large, we split data into small data. We sent the small data to the mapper nodes. Each nodes is a computer.
In Mapper, it does all the tasks and process of data.
In reducer, it shuffles and do aggregation on all data and send to output.
Note: Mapper does all the tasks, and no task is done on the reducer.
Extracted from https://data-flair.training/blogs/hadoop-ecosystem-components/
HIve is for SQL query.
Mahout is for Machine LEarning.
HBase is for columnar store.
HDFS is where data is stored.
Apache Spark is another very popular BIg Data Softwares or Systems. Apache SPark is faster and has Machine Learning libraries.
Apache Spark can be in Hadoop:
Extracted from: https://www.edureka.co/blog/hadoop-ecosystem
Go to Databricks.com Click Login
Click Sign in here.
Select import and export data.
Put in iris.csv Click Create Table in Notebook
Cmd 2 Change to something like this
Cmd 3 Change to something like this.
Cmd 4 Change to something like this.
Click to add more cells.
Cmd 5 Change to something like this.
Cmd 6 Change to something like this.
Cmd 7 Change to something like this.
Go Cmd 2, run all below